Relationship modeling of encode quality and encode parameters based on source attributes

ABSTRACT

A source quality of a source video and a source content complexity of the source video are identified. Parameter constraints with respect to parameters of an operation are received. The source video quality, source content complexity, and parameter constraints are applied to a deep neural network (DNN) producing DNN outputs. In an example, the DNN outputs are combined using domain knowledge to provide the filter parameters, as predicted, to a filter chain, such that applying the filter chain to the input source video results in an output video achieving the full reference video quality score. In another example, the DNN outputs are combined using domain knowledge to provide the filter parameters, as predicted, to a filter chain, such that applying the filter chain to the input source video results in an output video achieving the full reference video quality score.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. provisional application Ser.No. 63/055,945 filed Jul. 24, 2020, the disclosure of which is herebyincorporated in its entirety by reference herein.

TECHNICAL FIELD

Aspects of the disclosure generally relate to automated relationshipmodeling of encode quality and encode parameters based on source videoinputs.

BACKGROUND

Objective video quality assessment methods predict a perceptual qualityof a video, targeted at reproducing or best approximating human visualsubjective quality assessment of the video. Depending on theavailability of a perfect-quality pristine original video as thereference, VQA methods may be classified into full-reference (FR),reduced-reference (RR) and no-reference (NR) methods. FR methods assumethe reference video is fully accessible, RR methods assume that thereference video is partially available in the form of pre-computedfeatures rather than video pixels, and NR methods (or sometimes referredto as blind methods) do not assume availability of the reference video.

SUMMARY

In one or more illustrative examples, a method is provided forpredicting a full-reference video quality analysis of a source videothat is to be modified via scaling, transcoding, and/or other filters.The method includes identifying a source video quality of the sourcevideo; identifying a source content complexity of the source video;receiving target output video parameters to be applied to the sourcevideo; applying the source video quality, source content complexity, andtarget output video parameters to a deep neural network (DNN) producingDNN outputs; and combining the DNN outputs using domain knowledge toproduce an overall predicted quality score of an output video created byapplying the target output video parameters to the source video.

In one or more illustrative examples, a method is provided of predictingbitrate, codec, resolution, or other filter parameters for a filterchain to achieve a full reference video quality score for encoding aninput source video. The method includes identifying a source quality ofthe source video; identifying a source content complexity of the sourcevideo; receiving parameter constraints with respect to the parameters;applying the source video quality, source content complexity, andparameter constraints to a deep neural network (DNN) producing DNNoutputs; and combining the DNN outputs using domain knowledge to providethe filter parameters, as predicted, to a filter chain, such thatapplying the filter chain to the input source video results in an outputvideo achieving the full reference video quality score.

In one or more illustrative examples, a method is provided of predictinga full-reference video quality score of a source video after performanceof scaling, transcoding, and/or filtering operations. The methodincludes identifying a source video quality of the source video;identifying a source content complexity of the source video; receivingcontent parameters of the source video; receiving player metricsindicative of aspects of playback by a consumer device of an outputvideo corresponding to the source video; receiving parameter constraintswith respect to parameters of the output video; applying the sourcevideo quality, the source content complexity, the content parameters,the parameter constraints, and the player metrics to a deep neuralnetwork (DNN) producing DNN outputs; and combining the DNN outputs usingdomain knowledge to produce an overall predicted quality score of theoutput video, without accessing the output video.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates an example of a video-on-demand system utilizing arecommendation engine for the determination of predicted video quality;

FIG. 2 illustrates an example of an over-the-top system utilizing aprediction engine for the prediction of video quality;

FIG. 3 illustrates an example 3D surface plot of a relationship betweenvideo content encoding factors;

FIG. 4 illustrates an example process for use of a deep neural networkinference model to compute a predicted output quality score;

FIG. 5 illustrates an embodiment of a deep neural network inferencemodel used to compute a predicted output quality score;

FIG. 6 illustrates yet another embodiment of the present disclosure,illustrating domain-knowledge model computation and knowledgeaggregation processes;

FIG. 7 illustrates the framework and data flow diagram ofscale/resolution video decomposition followed by per-resolution channeldeep neural network computations and domain-knowledge drivencombination;

FIG. 8 illustrates the framework and data flow diagram of spatiotemporaldecomposition followed by per-spatiotemporal channel deep neural networkcomputations and domain-knowledge driven combination;

FIG. 9 illustrates the framework and data flow diagram of contentanalysis based video decomposition followed by per-content type deepneural network computations and domain-knowledge driven combination;

FIG. 10 illustrates the framework and data flow diagram of distortionanalysis based video decomposition followed by per-distortion type deepneural network computations and domain-knowledge driven combination;

FIG. 11 illustrates the framework and data flow diagram ofluminance-level and bit-depth based video decomposition followed by perluminance-level and bit-depth deep neural network computations anddomain-knowledge driven combination; and

FIG. 12 illustrates an example computing device for the performance ofthe operations of the recommendation engine predicting of video quality.

DETAILED DESCRIPTION

As required, detailed embodiments of the present invention are disclosedherein; however, it is to be understood that the disclosed embodimentsare merely exemplary of the invention that may be embodied in variousand alternative forms. The figures are not necessarily to scale; somefeatures may be exaggerated or minimized to show details of particularcomponents. Therefore, specific structural and functional detailsdisclosed herein are not to be interpreted as limiting, but merely as arepresentative basis for teaching one skilled in the art to variouslyemploy the present invention.

There are numerous choices to be made for compression of a video interms of parameters, such as output resolution, framerate, codec, andbitrate. In addition, depending on the application, the cost ofexperimentation may be high. Given a goal to maintain as much of thesource quality as possible, an unguided approach to identifying acombination of output parameters may be time consuming and costly.Additionally, one set of output parameters may only work for certaininput types or certain files, increasing the time required to search foroptimal parameters.

An inference model may be constructed to predict a resulting videoquality based on various input attributes in conjunction with variousoutput parameters. These input attributes may include, for example:resolution, framerate, codec (if applicable), bitrate, quality, and/orcomplexity. The output parameters may include: resolution, framerate,codec, and/or bitrate. Further aspects of the inference modeling aredescribed in detail herein.

FIG. 1 illustrates an example of a video-on-demand (VOD) system 100utilizing a recommendation engine 112 for the determination of predictedvideo quality 115. A VOD system 100 generally allows a user to streamcontent, either through a traditional set-top box or through remotedevices such as computers, tablets, and smartphones, by requesting thecontent, not by waiting for the content to be broadcast according to aschedule. As generally shown, an objective quality analysis 104 isperformed to a source video 102. If the source video 102 is of asuitable quality, the source video 102 is applied to a recommendationengine 112 (along with customer constraints 114) to produce arecommendation output 116 having a predicted video quality 115. Therecommendation output 116 is used to control a video encoder 118, whichencodes the source video 102 into an encoded video 120 of the predictedvideo quality 115.

The source video 102 may include, as some examples, a live video feedfrom a current event, a prerecorded show or movie, and/or anadvertisement or other clip to be inserted into another video feed. Thesource video 102 may include just video in some examples, but in manycases the source video 102 further includes additional content such asaudio, subtitles, and metadata information descriptive of the contentand/or format of the video. The source video 102 may be provided invarious video formats, for example, Serial digital interface (SDI),transport stream, multicast Internet Protocol (IP), or mezzanine filesfrom content producers/providers.

The video encoders 118 may receive the source video 102 from thesources. The video encoders 118 may be located at a head-end of a VODvideo transmission pipeline in an example. The video encoders 118 mayinclude electronic circuits and/or software configured to compress thesource video 102 into a format that conforms with one or more standardvideo compression specifications. The output may be referred to asencoded video 120. Examples of video encoding formats include MPEG-2Part 2, MPEG-4 Part 2, H.264 (MPEG-4 Part 10), HEVC, Theora, RealVideoRV40, VP9, and AV1. In many cases, the encoded video 120 lacks someinformation present in the original source video 102, which is referredto as lossy compression. A consequence of this is that the encoded video120 may have a lower quality than the original, uncompressed sourcevideo 102.

In some cases, the video encoders 118 may perform transcoding operationsto re-encode the video content from a source format, resolution, and/orbit depth into an instance of video content with a different format,resolution, and/or bit depth. In many examples, the video encoders 118may be used to create, for each received instance of source video 102content, a set of time-aligned video streams, each with a differentbitrate and frame size. This set of video steams may be referred to as aladder or compression ladder. It may be useful to have differentversions of the same video streams in the ladder, as downstream usersmay have different bandwidth, screen size, or other constraints.

Spatial information refers to aspects of the information within a frame,such as textures, highlights, etc. Temporal information refers toaspects of the information between frames, such as motion or otherdifferences between frames. In video encoding, the more complex thespatial and temporal content of the source video 102, or even a specifictitle, scene, frame, the worse the quality of encoded video will beperceived to a viewer when the same amount of bitrate is used during theencoding. However, encoding the source video 102 using a higher bitratemay require additional bandwidth to transmit the video. One solution isto use an encoding ladder to produce multiple different encodes of thecontent. The ladder may include several encoding configurations orprofiles outlining a spectrum of bitrate/resolution combinations used toencode video content. In some cases, multiple adaptive bitrate (ABR)ladders may be used for the same content, for example for differentinput stream quality levels (e.g., low quality, high quality, etc.), fordifferent output stream quality levels (e.g., low quality service, highquality premium service, etc.), for supporting end user devices that usedifferent decoders, for different output resolutions (e.g., 144p, 240p,360p, 480p, 720p, 1080p), etc. The video encoders 118 may create, foreach received instance of source video 102 content, a set oftime-aligned video streams, each having a different bitrate andresolution according to the ladder.

The encoded video 120 may be used for further purposes, once encoded.This may include video consumption and/or storage 122. For example, oneor more packagers may have access to the ladders for each of theinstances of encoded video 120. The packagers may create segmented videofiles to be delivered to clients that then stitch the segments togetherto form a contiguous video stream. The segmented video may include videofragments, as well as a manifest that indicates how to combine thefragments. A user may then choose among the available ladder encodingsbased on bandwidth or other device requirements.

Significantly, in such a VOD system 100, it may be difficult to obtaininformation with respect to the ultimate perceived quality of the videobeing provided to the consumer. However, such quality information may bebeneficial to have in order to best perform the compression of thesource video 102.

Thus, as further shown in the VOD system 100, a quality analysis 104 maybe performed on the source video 102. QoE of a video, as used herein,relates to mapping human perceptual QoE onto an objective scale, i.e.,the average score given by human subjects when expressing their visualQoE when watching the playback of a video content. For example, a scoremay be defined on a scale of 0-100, which can be evenly divided to fivequality ranges of bad (0-19), poor (20-39), fair (40-59), good (60-79),and excellent (80-100), respectively. One example objective QoE score isthe SSIMPLUS score. It should be noted that the quality analysis 104 maybe a no-reference algorithm, as there may be no other version of thesource video 102 to use to compare.

If the result of the quality analysis 104 is a low-quality videoindication 106 (e.g., a score below a predefined value or range), thendefault processing 110 may occur, without review by the recommendationengine 112. This may be done, for example, to avoid basingrecommendations on video that is of too low-quality to produce goodresults when modeled. For instance, a minimum quality of the video mayserve as a gatekeeper to further processing of the video. The defaultprocessing 110 may include various approaches, such as processing of thesource video 102 without a recommendation, or rejection of the sourcevideo 102 from further processing more generally. If, however, theresult of the quality analysis 104 is a suitable-quality videoindication 108 (e.g., a score at, within, or above a predefined value orrange of quality score), the source video 102 may be applied as an inputto the recommendation engine 112.

The recommendation engine 112 may also receive customer constraints 114.These customer constraints 114 may include, for instance, qualityassurances for the resultant ladders. As one specific example, this mayinclude an assurance that the output ABR ladder is 10% worse for ahighest profile, 15% for a second highest profile, etc., all the way to30% worse for a lowest profile. The customer constraints 114 may includeother constraints as well, such as restrictions on what codecs to use,resolutions to use, maximum bitrate, etc.

The recommendation engine 112 may utilize this information to determinea recommendation output 116 based on an inference model. (Exampleinference models are discussed in detail below with respect to FIGS.4-11.) This recommendation output 116 may be used to meet a predictedvideo quality 115 as inferred by the model. The recommendation output116 may, accordingly, indicate to the video encoder 118 the parametersto use to encode the source video 102 into encoded video 120 meeting thecustomer constraints 114. For instance, the recommendation output 116may include a recommendation for the entire ladder in terms ofresolution, bitrate, codec, etc.

FIG. 2 illustrates an example of an over-the-top (OTT) system 200utilizing a prediction engine 220 for the prediction of video quality.In general, an OTT system 200 is a streaming media service offereddirectly to viewers via the Internet, bypassing more traditional cable,broadcast, and satellite television platforms (such as the VOD system100 discussed above). For such an OTT system 200, it may be desirable tomeasure the quality at a consumer device 214. However, many OTT systems200 employ digital rights management (DRM) to ensure integrity of thevideo delivery chain. Because of this, it may be difficult to measurevideo stream quality at or close to the consumer device 214. Nor can thequality be inferred simply by using player metrics 226 alone (e.g.,buffering events, initial stalling events, profile changes from oneprofile of an ABR ladder to another, etc.). Instead, as shown,identifying what output parameters customers are employing may provide abest-case scenario view of what the video quality may be at the consumerend. From there, and further accounting for the player metrics 226, aprediction of what the end-user video quality would be may be performedbased on processing the source video 102.

More specifically, the source video 102 may be provided to a contententry point 202, which resulted in the reception of received video 204.This received video 204 may be provided to a DRM platform 206 forencoding of the received video 204 using DRM. This results inDRM-encoded video 208. The DRM-encoded video 208 may then be packagedand provided to one or more origins to a content delivery network (CDN)210. The origins refer to a location of the content delivery network 210to which video content enters the content delivery network 210. In somecases, the packagers serve as origins to the content delivery network210, while in other cases, the packagers push the video fragments andmanifests into the origins. The content delivery network 210 may includea geographically-distributed network of servers and data centersconfigured to provide the video content (including the DRM-encoded video208 content, as shown) from the origins as consumer video 212 toconsumer devices 214. The consumer devices 214 may include, as someexamples, televisions or other video screens, tablet computing devices,and/or mobile phones. The consumer devices 214 may execute a videoplayer to validate the device, remove the DRM, and play back thecontent. These varied consumer devices 214 may have different viewingconditions (including illumination and viewing distance, etc.), spatialresolution (e.g., SD, HD, full-HD, UHD, 4K, etc.), frame rate (15, 24,30, 60, 120 frames per second, etc.), dynamic range (8 bits, 10 bits,and 12 bits per pixel per color, etc.).

The received video 204 may also be provided for processing by a qualityanalysis 216. This may be done, for instance, as discussed with respectto the quality analysis 104. Moreover, in addition to or in thealternative of generation of a quality score 218, the quality analysis216 may include the computation of a content complexity score 218. Thereare several various complexity metrics, such as the sum of absolutedifferences (SAD) of the pixel values within a frame. A video frame thatis all the same color may have a SAD of 0, where if the pixel valuealternated between the min and max values the SAD would be a function ofthe resolution. Complexity can act as a proxy for how well an encoderwill preserve the quality at a target bitrate.

The quality score/complexity score 218 may be provided to a predictionengine 220. The prediction engine 220 may be configured to predict thequality of the source video 102 at the consumer device 214, modeledusing the quality score/complexity score 218 as well as otherinformation such as content parameters 222 (e.g., resolution, bitrate,codec, etc. of the received video 204. Using these sources ofinformation, and an inference model (e.g., as discussed with respect toFIGS. 4-11), the prediction engine 220 may infer a consumer sitepredicted quality 224 score. This consumer site predicted quality 224score and other sources of information, such as the player metrics 226(e.g., buffering events, profile changes, etc.) may then be provided toa second quality analysis 228 to generate a consumer quality score 230.The prediction engine 220 may also utilize information with respect tothe target consumer video 212 parameters post transcode to aid in theprediction in combination with the player metrics 226. For instance, ifa pristine 4K source transcoded to 500 kbps at 360p resolution is beingplayed on a 4K TV (in an example), then those target consumer video 212parameters should additionally be accounted as they may have asignificant effect on the consumer quality score 230. In sum, thisconsumer quality score 230 may be indicative of the quality of theconsumer video 212 as played back at the consumer device 214, despitethe consumer video 212 not being available for analysis.

Inference models, such as those utilized in the context of the VODsystem 100 and the OTT system 200 may therefore be used to infer apredicted quality of a video content while lacking access to the videocontent itself (or even without the creation of the encoded video). Todo so, a model may be created that defines a relationship where, ifgiven a set of input parameters and a target output quality value, theoutput resolution, framerate, codec and bitrate can be inferred. Toreduce the search space, certain parameters of the model may be fixed tohelp find the value that would create optimal values for the remainingunknowns.

The model may be configured to operate on a per-sequence, or per-titleworkflow predicting what the best output values would be for a certainsegment of video. This model can be trained for different types of videoto classify the input to further reduce the search space going forward.For instance, if the user would like to specify parameters for a sportsprogram versus a talking head program, the search space may be tuned forone content type versus the other.

FIG. 3 illustrates an example 3D surface plot 300 of a relationshipbetween video content encoding factors. As shown, the relationship isbetween content complexity, bitrate, and content quality score. Asshown, increased quality generally correlates to increased bitrate. Fora given input complexity and output bitrate, the output quality cantherefore be inferred according to that relationship. This collapsesseveral dimensions but when certain dimensions are fixed, (e.g., codec,resolution, etc.) the predicted output quality may then be determined.Using this and other relationships between video factors, the size ofthe search space to be modeled may be reduced.

A deep neural network (DNN) may be used to perform the inference betweenthe quality score/complexity score 218 and content parameters 222 andthe predicted quality 224. Notably, this predicted quality 224 output isnot a full reference score of the current video. Instead the predictedquality 224 is a predicted full reference score of the unobtained outputvideo having had the user parameters applied to it.

FIG. 4 illustrates an example process for use of a deep neural network(DNN) inference model to compute a predicted output quality score 420.As shown, output parameters are received at operation 402, to which aparameter analysis 404 is performed. Moreover, a viewingdevice/condition analysis operation 406 is performed to obtain viewingdevice and viewing condition parameters. This is followed by a humanvisual system (HVS) modeling operation 408 that takes the above analysissteps as input parameters. A video input 410 whose quality is to beassessed is given. The video input 410 is applied to a no-referencesignal quality analysis 412 and/or to a content complexity analysis 414.One or multiple deep neural networks (DNNs) 416 are then applied to thedecomposed signals at multiple channels. Finally, a combinationoperation 418 combines the analysis and modeling results with the DNN416 outputs to produce a predicted output quality score 420 for thevideo input 410. Significantly, this predicted output quality score 420is a predicted full-reference score of unobtained output video that hasthe user parameters applied to it, not a quality score of the videoinput 410 itself.

FIG. 5 illustrates an embodiment of a deep neural network (DNN)inference model used to compute a predicted output quality score 526. Asshown, an input video 500 passes through a signal decomposition 502process that transforms the signal into multiple channels (N channels intotal)

For each channel, a deep neural network (DNN) 504, 506, 508 is used toproduce a channel-specific quality prediction (N DNNs in total). Thisprediction may be, for instance, in the form of a scalar quality scoreor of a quality parameter vector. The signal decomposition results alsoaid in the analysis of the input video 500 in the content analysisprocess 510. The distortion analysis process 512 is then applied toidentify the distortions and artifacts in the input video 500. Viewingdevice parameters 514 and viewing condition parameters 518 may beobtained separately and used for HVS modeling 516 and viewing deviceanalysis 520 processes. An aggregation process 522 collects all theinformation from the outputs of the content analysis process 510, thedistortion analysis process 512, the HVS modeling 516, and the viewingdevice analysis 520, and performs an aggregation to provide aggregatedata used to guide the combination process 524 of all DNN outputs, inturn producing an predicted output quality score 526 of the input video500.

FIG. 6 illustrates yet another embodiment of the present disclosure,illustrating domain-knowledge model computation and knowledgeaggregation processes. As shown, the video input 600 is first fed into acontent analysis module 602. The content analysis module 602 may performcontent analysis, including, for instance, content type classificationand content complexity assessment. In an example, the video content ofthe video input 600 may be classified into one or more of the categoriesof sports, animation, screen content, news, show, drama, documentary,action movie, advertisement, etc. The content may also be classifiedbased on signal activities or complexities. For example, based on thevideo's spatial information content (strength and spread of fine texturedetails, sharp edge features, and smooth regions), temporal informationcontent (amount and speed of camera and object motion), colorinformation content (diversities in hue and saturation), and/or noiselevel (camera noise, film grain noise, synthesized noise, etc.), thevideo input may be classified into high, moderate, or low complexitycategories for each of these criteria.

The video input 600 is also provided through a distortion analysismodule 604, where the distortions and visual artifacts in the videoinput 600 are detected and the distortion levels are evaluated. Thecauses of distortions may include different types of lossy videocompression (such as MPEG-1, MPEG-2, MPEG-4, H.261, H.263, H.264/AVC,H.265/HEVC, DV, VC-1, AV1, VPx, AVSx, FVC, VVC, Motion JPEG, MotionJPEG2000, Pro-Res, Theora, and other types of image/video compressionstandards) and errors occur during image acquisition, encoding,decoding, transmission, color space conversion, color sampling, spatialscaling, denoising, contrast enhancement, frame rate change, color anddynamic range tone mapping, and rendering. The appearance of visualartifacts may include blur, blocking, banding, ringing, noise, colorshift, skin tone shift, color bleeding, exposure shift, contrast shift,highlight detail loss, shadow detail loss, texture loss, fake texture,flickering, jerkiness, jittering, floating, etc. The distortion analysisprocess may detect and quantify one or more of these artifacts, orproduce visibility probability estimation of each of the visualartifacts.

The viewing condition parameters 606 may be obtained separately from thevideo input 600. The viewing condition parameters 606 may include theviewing distance and lighting condition of the viewing environment. Theyare used by the HVS modeling module 608 to quantify the visibility ofdistortions and artifacts. The computational HVS models of the HVSmodeling module 608 may incorporate the contrast sensitivity function(CSF) of the visual system, which measures the human visual signal,contrast or error sensitivity as a function of spatial and temporalfrequencies and may be functions of the luminance of the display andviewing environment. The HVS model may also incorporate visual luminancemasking, which measures the visibility variation of signals due tosurrounding luminance levels. The HVS model may also incorporate thevisual contrast/texture masking, which measures the reduction ofdistortion/artifact visibility according to the strength and contrast ofsignals nearby in terms of spatial and temporal location, spatial andtemporal frequency, and texture structure and orientation. The HVS modelmay also incorporate visual saliency and attention models, whichestimate the likelihood/probability of each spatial and temporallocation in the video that will attract visual attention and fixations.The HVS model may also incorporate visibility models of specificartifacts of blur, blocking, banding, ringing, noise, color shift, skintone shift, color bleeding, exposure shift, contrast shift, highlightdetail loss, shadow detail loss, texture loss, fake texture, flickering,jerkiness, jittering, floating, etc.

The viewing device parameters 610 may also be obtained separately fromthe video input 600. The viewing device parameters 610 may includedevice type and model, screen size, video window size, resolution,brightness, bit depth, and contrast ratio. These parameters are used bythe viewing device analysis module 612 for device categoryclassification, and are fed into the HVS modeling module 608 as input.

The results of content analysis module 602, distortion analysis module604, HVS modeling module 608, and viewing device analysis module 612 arecollected by the knowledge aggregation module 614 according to theaggregation process 522, which outputs aggregated domain knowledge 616to be combined with data-driven DNN results (e.g., via the combinationprocess 524 of FIG. 5 and the combination operation 418 of FIG. 4).

FIG. 7 illustrates the framework and data flow diagram ofscale/resolution video decomposition followed by per-resolution channelDNN computations and domain-knowledge driven combination. Here, thesignal decomposition 502 method may be a scale/resolution decomposition702 that transforms the video input 700 into multi-scale ormulti-resolution representations, e.g., as Res 1 (element 704), Res 2(element 706), . . . , Res N (element 708) as shown. Examples of thedecomposition methods include Fourier transforms, the discrete cosinetransform (DCT), the discrete sine transform (DST), the wavelettransform, the Gabor transform, the Haar transform, the Laplacianpyramid transform, the Gaussian pyramid transform, the steerable pyramidtransform, and other types of frequency decomposition, spatial-frequencydecomposition, multi-scale decomposition, and multi-resolutiondecomposition methods.

The multi-scale multi-resolution representations are fed into a seriesof DNNs 710, 712, 717, and their outputs are combined using aknowledge-driven approach 718 that is guided by domain knowledge 716,resulting in final quality score 720 of the video input. An example ofthe domain knowledge 716 used here is the importance or weights createdthrough HVS modeling module 608 that predicts the visual relevance ofeach of the multi-scale multi-resolution representations.

FIG. 8 illustrates the framework and data flow diagram of spatiotemporaldecomposition followed by per-spatiotemporal channel DNN computationsand domain-knowledge driven combination. Here, the signal decomposition502 method may be a spatiotemporal decomposition 802 that transforms thevideo input 800 into multiple spatiotemporal channel representations,e.g., as ST 1 (element 807), ST 2 (element 806), . . . , ST N (element808) as shown. Examples of the decomposition methods include 3D Fouriertransforms, 3D DCT, 3D wavelet transform, 3D Gabor transform, 3D Haartransform, 3D Laplacian and Gaussian pyramid transforms, and other typesof spatial-temporal-frequency and 3D oriented decomposition methods.These transforms or decompositions may be applied to multipleconsecutive frames, or a group-of-picture (GoP).

The spatiotemporal channel representations are fed into a series of DNNs810, 812, 817, and their outputs are combined using a knowledge-drivenapproach 818 that is guided by domain knowledge 816, resulting in finalquality score 820 of the video input. An example of the domain knowledge816 used here is the importance or weights created throughspatiotemporal HVS modeling via the HVS modeling module 608 thatpredicts the visual relevance of each of the spatiotemporal channelrepresentations.

FIG. 9 illustrates the framework and data flow diagram of contentanalysis based video decomposition followed by per-content type DNNcomputations and domain-knowledge driven combination. Here, the signaldecomposition 502 method may be a content type decomposition 902 thattransforms the video input 900 into multiple representations, e.g., asC-Type 1 (element 907), C-Type 2 (element 906), . . . , C-Type N(element 908) as shown. One example of the decomposition method is toclassify and segment the scenes or frames of the video into differentcontent categories, such as sports, animation, screen content, news,show, drama, documentary, action movie, advertisement, etc. Anotherexample of the decomposition method is to classify and segment thescenes or frames of the video into different content complexitycategories, such as high, moderate, and low complexity categories interms of one or more of the video's spatial information content(strength and spread of fine texture details, sharp edge features, andsmooth regions), temporal information content (amount and speed ofcamera and object motion), color information content (diversities in hueand saturation), and/or noise level (camera noise, film grain noise,synthesized noise, etc.).

The C-Type representations are fed into a series of DNNs 910, 912, 917,and their outputs are combined using a knowledge-driven approach 918that is guided by domain knowledge 916, resulting in final quality score920 of the video input. An example of the domain knowledge 916 used hereis the importance and/or weights created through content analysis module602 that predicts the likelihood of the content types and the importanceof each content type in the overall quality assessment.

FIG. 10 illustrates the framework and data flow diagram of distortionanalysis based video decomposition followed by per-distortion type DNNcomputations and domain-knowledge driven combination. Here, the signaldecomposition 1002 method may be a distortion type decomposition 1002that transforms the video input 1000 into multiple representations,e.g., as D-Type 1 (element 1007), D-Type 2 (element 1006), . . . ,D-Type N (element 1008) as shown. One example of the decompositionmethod is to segment the videos into scenes or groups of pictures(GoPs), each of which is associated with an assessment on thelikelihoods of containing each of a list of distortion types. Suchdistortion types may include one or more of blur, blocking, banding,ringing, noise, color shift, color bleeding, skin tone shift, exposureshift, contrast shift, highlight detail loss, shadow detail loss,texture loss, fake texture, flickering, jerkiness, jittering, floating,etc.

The D-Type representations are fed into a series of DNNs 1010, 1012,1017, and their outputs are combined using a knowledge-driven approach1018 that is guided by domain knowledge 1016, resulting in final qualityscore 1020 of the video input. An example of the domain knowledge 1016used here is the importance and/or weights created through distortionanalysis 307 that predicts the likelihood of the distortion types andthe importance of each distortion type in the overall qualityassessment.

FIG. 11 illustrates the framework and data flow diagram ofluminance-level and bit-depth based video decomposition followed by perluminance-level and bit-depth DNN computations and domain-knowledgedriven combination. Here, the signal decomposition 1102 method may be aluminance level and bit-depth decomposition 1102 that transforms thevideo input 1100 into multiple representations, e.g., as luminance level(LL) 1 (element 1107), LL 2 (element 1106), . . . , LL N (element 1108)as shown. One example of the decomposition method is to segment thevideo scenes or frames into different regions, each of which isassociated with a range of luminance levels or bit-depths.

The LL representations are fed into a series of DNNs 1110, 1112, 1117,and their outputs are combined using a knowledge-driven approach 1118that is guided by domain knowledge 1116, resulting in final qualityscore 1120 of the video input. An example of the domain knowledge 1116used here is the importance and/or weights created through viewingdevice analysis 312, HVS modeling 308, and distortion analysis 307 thatassess the importance of each luminance level or bit-depth in theoverall quality assessment.

These various inference models may be used to infer a predicted qualityof a video content while lacking access to the video content itself. Forexample, the models may be used to generate a recommendation output 116to ensure that the video is encoded at a predicted video quality 11while meeting the customer constraints 114. As another example, themodels may be used to form a prediction of end-user video qualityaccounting for the player metrics 226. Significant, these determinationsmay be performed based on processing of the source video 102, withoutuse of the end-user video.

FIG. 12 illustrates an example computing device 1200 for the performanceof the operations of the recommendation engine predicting of videoquality. The algorithms and/or methodologies of one or more embodimentsdiscussed herein, such as those illustrated with respect to FIGS. 1-11,may be implemented using such a computing device 1200. The computingdevice 1200 may include memory 1202, processor 1204, and non-volatilestorage 1206. The processor 1204 may include one or more devicesselected from high-performance computing (HPC) systems includinghigh-performance cores, microprocessors, micro-controllers, digitalsignal processors, microcomputers, central processing units, fieldprogrammable gate arrays, programmable logic devices, state machines,logic circuits, analog circuits, digital circuits, or any other devicesthat manipulate signals (analog or digital) based on computer-executableinstructions residing in memory 1202. The memory 1202 may include asingle memory device or a number of memory devices including, but notlimited to, random access memory (RAM), volatile memory, non-volatilememory, static random-access memory (SRAM), dynamic random access memory(DRAM), flash memory, cache memory, or any other device capable ofstoring information. The non-volatile storage 1206 may include one ormore persistent data storage devices such as a hard drive, opticaldrive, tape drive, non-volatile solid-state device, cloud storage or anyother device capable of persistently storing information.

The processor 1204 may be configured to read into memory 1202 andexecute computer-executable instructions residing in programinstructions 1208 of the non-volatile storage 1206 and embodyingalgorithms and/or methodologies of one or more embodiments. The programinstructions 1208 may include operating systems and applications. Theprogram instructions 1208 may be compiled or interpreted from computerprograms created using a variety of programming languages and/ortechnologies, including, without limitation, and either alone or incombination, Java, C, C++, C #, Objective C, Fortran, Pascal, JavaScript, Python, Perl, and PL/SQL.

Upon execution by the processor 1204, the computer-executableinstructions of the program instructions 1208 may cause the computingdevice 1200 to implement one or more of the algorithms and/ormethodologies disclosed herein. The non-volatile storage 1206 may alsoinclude data 1210 supporting the functions, features, and processes ofthe one or more embodiments described herein. This data 1210 mayinclude, as some examples, the source video input, content parameters,player metrics, quality scores, content complexity scores, domainknowledge, models, and predicted quality scores.

The processes, methods, or algorithms disclosed herein can bedeliverable to/implemented by a processing device, controller, orcomputer, which can include any existing programmable electronic controlunit or dedicated electronic control unit. Similarly, the processes,methods, or algorithms can be stored as data and instructions executableby a controller or computer in many forms including, but not limited to,information permanently stored on non-writable storage media such as ROMdevices and information alterably stored on writeable storage media suchas floppy disks, magnetic tapes, CDs, RAM devices, and other magneticand optical media. The processes, methods, or algorithms can also beimplemented in a software executable object. Alternatively, theprocesses, methods, or algorithms can be embodied in whole or in partusing suitable hardware components, such as Application SpecificIntegrated Circuits (ASICs), Field-Programmable Gate Arrays (FPGAs),state machines, controllers or other hardware components or devices, ora combination of hardware, software and firmware components.

While exemplary embodiments are described above, it is not intended thatthese embodiments describe all possible forms encompassed by the claims.The words used in the specification are words of description rather thanlimitation, and it is understood that various changes can be madewithout departing from the spirit and scope of the disclosure. Aspreviously described, the features of various embodiments can becombined to form further embodiments of the invention that may not beexplicitly described or illustrated. While various embodiments couldhave been described as providing advantages or being preferred overother embodiments or prior art implementations with respect to one ormore desired characteristics, those of ordinary skill in the artrecognize that one or more features or characteristics can becompromised to achieve desired overall system attributes, which dependon the specific application and implementation. These attributes caninclude, but are not limited to cost, strength, durability, life cyclecost, marketability, appearance, packaging, size, serviceability,weight, manufacturability, ease of assembly, etc. As such, to the extentany embodiments are described as less desirable than other embodimentsor prior art implementations with respect to one or morecharacteristics, these embodiments are not outside the scope of thedisclosure and can be desirable for particular applications.

With regard to the processes, systems, methods, heuristics, etc.described herein, it should be understood that, although the steps ofsuch processes, etc. have been described as occurring according to acertain ordered sequence, such processes could be practiced with thedescribed steps performed in an order other than the order describedherein. It further should be understood that certain steps could beperformed simultaneously, that other steps could be added, or thatcertain steps described herein could be omitted. In other words, thedescriptions of processes herein are provided for the purpose ofillustrating certain embodiments, and should in no way be construed soas to limit the claims.

Accordingly, it is to be understood that the above description isintended to be illustrative and not restrictive. Many embodiments andapplications other than the examples provided would be apparent uponreading the above description. The scope should be determined, not withreference to the above description, but should instead be determinedwith reference to the appended claims, along with the full scope ofequivalents to which such claims are entitled. It is anticipated andintended that future developments will occur in the technologiesdiscussed herein, and that the disclosed systems and methods will beincorporated into such future embodiments. In sum, it should beunderstood that the application is capable of modification andvariation.

All terms used in the claims are intended to be given their broadestreasonable constructions and their ordinary meanings as understood bythose knowledgeable in the technologies described herein unless anexplicit indication to the contrary in made herein. In particular, useof the singular articles such as “a,” “the,” “said,” etc. should be readto recite one or more of the indicated elements unless a claim recitesan explicit limitation to the contrary.

The abstract of the disclosure is provided to allow the reader toquickly ascertain the nature of the technical disclosure. It issubmitted with the understanding that it will not be used to interpretor limit the scope or meaning of the claims. In addition, in theforegoing Detailed Description, it can be seen that various features aregrouped together in various embodiments for the purpose of streamliningthe disclosure. This method of disclosure is not to be interpreted asreflecting an intention that the claimed embodiments require morefeatures than are expressly recited in each claim. Rather, as thefollowing claims reflect, inventive subject matter lies in less than allfeatures of a single disclosed embodiment. Thus, the following claimsare hereby incorporated into the Detailed Description, with each claimstanding on its own as a separately claimed subject matter.

While exemplary embodiments are described above, it is not intended thatthese embodiments describe all possible forms of the invention. Rather,the words used in the specification are words of description rather thanlimitation, and it is understood that various changes may be madewithout departing from the spirit and scope of the invention.Additionally, the features of various implementing embodiments may becombined to form further embodiments of the invention.

What is claimed is:
 1. A method of predicting a full-reference videoquality score of a source video after performance of scaling,transcoding, and/or filtering operations, the method comprising:identifying a source video quality of the source video; identifying asource content complexity of the source video; receiving target outputvideo parameters to be applied to the source video; applying the sourcevideo quality, the source content complexity, and the target outputvideo parameters to a deep neural network (DNN) producing DNN outputs;and combining the DNN outputs using domain knowledge to produce anoverall predicted quality score of an output video created by applyingthe target output video parameters to the source video.
 2. The method ofclaim 1, further comprising, in obtaining the domain knowledge,performing content analysis by classifying the source video intodifferent content type categories and/or classifying the source videointo different complexity categories.
 3. The method of claim 1, furthercomprising, in obtaining the domain knowledge, performing distortionanalysis by detecting different distortion types in the source video andclassifying the source video based on distortion type categories orestimating likelihoods of the distortion types.
 4. The method of claim1, further comprising, in obtaining the domain knowledge, performing HVSmodeling by using viewing condition and device parameters.
 5. The methodof claim 4, further comprising incorporating human visual contrastsensitivity function, luminance masking, contrast masking, texturemasking, visual attention, and fixation properties into the HVSmodeling.
 6. The method of claim 1, further comprising, in obtaining thedomain knowledge, performing viewing device analysis using viewingdevice parameters.
 7. The method of claim 1, further comprisingaggregating content analysis, distortion analysis, HVS modeling, andviewing device analysis into the domain knowledge.
 8. The method ofclaim 1, further comprising, using one or more of average, weightedaverage, feedforward neural networks, or support vector regressionapproaches to combine the DNN outputs and the domain knowledge toproduce the overall predicted quality score.
 9. The method of claim 1,further comprising: using a scale or resolution decomposition totransform the source video into multi-scale multi-resolutionrepresentations; passing the multi-scale multi-resolutionrepresentations into the DNN; and combining the DNN outputs using thedomain knowledge based on HVS modeling, wherein the HVS modelingpredicts a visual relevance of each of the multi-scale multi-resolutionrepresentations.
 10. The method of claim 9, further comprising using oneor more of Fourier transforms, a discrete cosine transform (DCT), adiscrete sine transform (DST), a wavelet transform, a Gabor transform, aHaar transform, a Laplacian pyramid transform, a Gaussian pyramidtransform, or a steerable pyramid transform to perform the decompositioninto the multi-scale multi-resolution representations.
 11. The method ofclaim 1, further comprising: using a spatiotemporal decomposition totransform the source video into multiple spatiotemporal channelrepresentations; passing the spatiotemporal channel representations intothe DNN; and combining the DNN outputs using the domain knowledge basedon spatiotemporal HVS modeling that predicts a visual relevance of eachof the spatiotemporal channel representations.
 12. The method of claim11, further comprising using one or more of 3D Fourier transforms, a 3DDCT, a 3D wavelet transform, a 3D Gabor transform, a 3D Haar transform,a 3D Laplacian pyramid transform, or a 3D Gaussian pyramid transform forperforming the spatiotemporal decomposition.
 13. The method of claim 1,further comprising: using a content type decomposition to transform thesource video into content type representations in terms of segments ofdifferent content categories and/or complexity categories; passing thecontent type representations into the DNN; and combining the DNN outputsusing the domain knowledge based on content analysis that predicts alikelihood and importance of the content categories and/or complexitycategories.
 14. The method of claim 13, further comprising, with respectto the content type decomposition, classifying the source video intohigh, moderate, and low complexity categories in terms of one or more ofspatial information content, temporal information content, colorinformation content, and/or noise level of the source video.
 15. Themethod of claim 1, further comprising: using a distortion typedecomposition to transform the source video into distortion typerepresentations in terms of video segments each associated withlikelihoods of containing each of a list of distortion types; passingthe distortion type representations into the DNN; and combining the DNNoutputs using the domain knowledge based on distortion analysis thatpredicts a likelihood and importance of each distortion type.
 16. Themethod of claim 1, further comprising: using a luminance level andbit-depth decomposition to transform the source video into multipleluminance level representations; passing the luminance levelrepresentations into the DNN; and combining the DNN outputs using thedomain knowledge based on viewing device analysis, HVS modeling, anddistortion analysis that assess an importance of each luminance level orbit-depth.
 17. A method of predicting bitrate, codec, resolution, orother filter parameters for a filter chain to achieve a full referencevideo quality score for encoding an input source video, the methodcomprising: identifying a source video quality of the source video;identifying a source content complexity of the source video; receivingparameter constraints with respect to the parameters; applying thesource video quality, the source content complexity, and the parameterconstraints to a deep neural network (DNN) producing DNN outputs; andcombining the DNN outputs using domain knowledge to provide the filterparameters, as predicted, to the filter chain, such that applying thefilter chain to the input source video results in an output videoachieving the full reference video quality score.
 18. The method ofclaim 17, wherein the filter chain includes a series of operations to beperformed on the source video, the series of operations including one ormore of a rescaling operation or a transcoding operation.
 19. The methodof claim 17, wherein the parameter constraints include one or more of atarget output quality score, a predefined bitrate, a specified codec, ora specified resolution.
 20. The method of claim 17, further comprising,in obtaining the domain knowledge, performing content analysis byclassifying the source video into different content type categoriesand/or classifying the source video into different complexitycategories.
 21. The method of claim 17, further comprising, in obtainingthe domain knowledge, performing distortion analysis by detectingdifferent distortion types in the source video and classifying thesource video based on distortion type categories or estimatinglikelihoods of the distortion types.
 22. The method of claim 17, furthercomprising, in obtaining the domain knowledge, performing HVS modelingby using viewing condition and device parameters.
 23. The method ofclaim 22, further comprising incorporating human visual contrastsensitivity function, luminance masking, contrast masking, texturemasking, visual attention, and fixation properties into the HVSmodeling.
 25. The method of claim 17, further comprising, in obtainingthe domain knowledge, performing viewing device analysis using viewingdevice parameters.
 26. The method of claim 17, further comprisingaggregating content analysis, distortion analysis, HVS modeling, andviewing device analysis into the domain knowledge.
 27. The method ofclaim 17, further comprising, using one or more of average, weightedaverage, feedforward neural networks, or support vector regressionapproaches to combine the DNN outputs and the domain knowledge toproduce the overall quality score.
 28. The method of claim 17, furthercomprising: using a scale or resolution decomposition to transform thesource video into multi-scale multi-resolution representations; passingthe multi-scale multi-resolution representations into the DNNs; andcombining the DNN outputs using the domain knowledge based on the HVSmodeling, wherein the HVS modeling predicts the visual relevance of eachof the multi-scale multi-resolution representations.
 29. The method ofclaim 28, further comprising using one or more of Fourier transforms,the discrete cosine transform (DCT), the discrete sine transform (DST),the wavelet transform, the Gabor transform, the Haar transform, theLaplacian pyramid transform, the Gaussian pyramid transform, or thesteerable pyramid transform to perform the multi-scale multi-resolutiondecomposition.
 30. The method of claim 17, further comprising: using aspatiotemporal decomposition to transform the source video into multiplespatiotemporal channel representations; passing the spatiotemporalrepresentations into the DNNs; and combining the DNN outputs using thedomain knowledge based on spatiotemporal HVS modeling that predicts thevisual relevance of each of the spatiotemporal channel representations.31. The method of claim 30, further comprising using one or more of 3DFourier transforms, the 3D DCT, the 3D wavelet transform, the 3D Gabortransform, the 3D Haar transform, the 3D Laplacian pyramid transform, orthe 3D Gaussian pyramid transform for performing the spatiotemporaldecomposition.
 32. The method of claim 17, further comprising: using acontent type decomposition to transform the source video into contenttype representations in terms of segments of different contentcategories and/or complexity categories; passing the content typerepresentations into the DNNs; and combining the DNN outputs using thedomain knowledge based on content analysis that predicts the likelihoodand importance of the content and complexity categories.
 33. The methodof claim 43, further comprising, with respect to the content typedecomposition, classifying the source video into high, moderate, and lowcomplexity categories in terms of one or more of spatial informationcontent, temporal information content, color information content, and/ornoise level of the source video.
 34. The method of claim 17, furthercomprising: using a distortion type decomposition to transform the videoinput into distortion type representations in terms of video segmentseach associated with likelihoods of containing each of a list ofdistortion types; passing the distortion type representations into theDNNs; and combining the DNN outputs using the domain knowledge based ondistortion analysis that predicts the likelihood and importance of eachdistortion type.
 35. The method of claim 17, further comprising: using aluminance level and bit-depth decomposition to transform the video inputinto multiple luminance level representations; passing the luminancelevel representations into the DNNs; and combining the DNN outputs usingthe domain knowledge based on viewing device analysis, HVS modeling, anddistortion analysis that assess the importance of each luminance levelor bit-depth.
 36. A method of predicting a full-reference video qualityscore of a source video after performance of scaling, transcoding,and/or filtering operations, the method comprising: identifying a sourcevideo quality of the source video; identifying a source contentcomplexity of the source video; receiving content parameters of thesource video; receiving player metrics indicative of aspects of playbackby a consumer device of an output video corresponding to the sourcevideo; receiving parameter constraints with respect to parameters of theoutput video; applying the source video quality, the source contentcomplexity, the content parameters, the parameter constraints, and theplayer metrics to a deep neural network (DNN) producing DNN outputs; andcombining the DNN outputs using domain knowledge to produce an overallpredicted quality score of the output video, without accessing theoutput video.
 37. The method of claim 36, wherein the player metrics areindicative of buffering events and/or profile changes at the consumerdevice.
 38. The method of claim 36, further comprising, in obtaining thedomain knowledge, performing content analysis by classifying the sourcevideo into different content type categories and/or classifying thesource video into different complexity categories.
 39. The method ofclaim 36, further comprising, in obtaining the domain knowledge,performing distortion analysis by detecting different distortion typesin the source video and classifying the source video based on distortiontype categories or estimating likelihoods of the distortion types. 40.The method of claim 36, further comprising, in obtaining the domainknowledge, performing HVS modeling by using viewing condition and deviceparameters.
 41. The method of claim 40, further comprising incorporatinghuman visual contrast sensitivity function, luminance masking, contrastmasking, texture masking, visual attention, and fixation properties intothe HVS modeling.
 42. The method of claim 36, further comprising, inobtaining the domain knowledge, performing viewing device analysis usingviewing device parameters.
 43. The method of claim 36, furthercomprising aggregating content analysis, distortion analysis, HVSmodeling, and viewing device analysis into the domain knowledge.
 44. Themethod of claim 36, further comprising, using one or more of average,weighted average, feedforward neural networks, or support vectorregression approaches to combine the DNN outputs and the domainknowledge to produce the overall predicted quality score.
 45. The methodof claim 36, further comprising: using a scale or resolutiondecomposition to transform the source video into multi-scalemulti-resolution representations; passing the multi-scalemulti-resolution representations into the DNN; and combining the DNNoutputs using the domain knowledge based on HVS modeling, wherein theHVS modeling predicts a visual relevance of each of the multi-scalemulti-resolution representations.
 46. The method of claim 45, furthercomprising using one or more of Fourier transforms, a discrete cosinetransform (DCT), a discrete sine transform (DST), a wavelet transform, aGabor transform, a Haar transform, a Laplacian pyramid transform, aGaussian pyramid transform, or a steerable pyramid transform to performthe decomposition into the multi-scale multi-resolution representations.47. The method of claim 36, further comprising: using a spatiotemporaldecomposition to transform the source video into multiple spatiotemporalchannel representations; passing the spatiotemporal channelrepresentations into the DNN; and combining the DNN outputs using thedomain knowledge based on spatiotemporal HVS modeling that predicts avisual relevance of each of the spatiotemporal channel representations.48. The method of claim 47, further comprising using one or more of 3DFourier transforms, a 3D DCT, a 3D wavelet transform, a 3D Gabortransform, a 3D Haar transform, a 3D Laplacian pyramid transform, or a3D Gaussian pyramid transform for performing the spatiotemporaldecomposition.
 49. The method of claim 36, further comprising: using acontent type decomposition to transform the source video into contenttype representations in terms of segments of different contentcategories and/or complexity categories; passing the content typerepresentations into the DNN; and combining the DNN outputs using thedomain knowledge based on content analysis that predicts a likelihoodand importance of the content categories and/or complexity categories.50. The method of claim 49, further comprising, with respect to thecontent type decomposition, classifying the source video into high,moderate, and low complexity categories in terms of one or more ofspatial information content, temporal information content, colorinformation content, and/or noise level of the source video.
 51. Themethod of claim 36, further comprising: using a distortion typedecomposition to transform the source video into distortion typerepresentations in terms of video segments each associated withlikelihoods of containing each of a list of distortion types; passingthe distortion type representations into the DNN; and combining the DNNoutputs using the domain knowledge based on distortion analysis thatpredicts a likelihood and importance of each distortion type.
 52. Themethod of claim 36, further comprising: using a luminance level andbit-depth decomposition to transform the source video into multipleluminance level representations; passing the luminance levelrepresentations into the DNN; and combining the DNN outputs using thedomain knowledge based on viewing device analysis, HVS modeling, anddistortion analysis that assess an importance of each luminance level orbit-depth.